Goto

Collaborating Authors

 stochastic gradient method



Exploiting the Structure: Stochastic Gradient Methods Using Raw Clusters

Neural Information Processing Systems

The amount of data available in the world is growing faster than our ability to deal with it. However, if we take advantage of the internal structure, data may become much smaller for machine learning purposes. In this paper we focus on one of the fundamental machine learning tasks, empirical risk minimization (ERM), and provide faster algorithms with the help from the clustering structure of the data. We introduce a simple notion of raw clustering that can be efficiently computed from the data, and propose two algorithms based on clustering information. Our accelerated algorithm ClusterACDM is built on a novel Haar transformation applied to the dual space of the ERM problem, and our variance-reduction based algorithm ClusterSVRG introduces a new gradient estimator using clustering. Our algorithms outperform their classical counterparts ACDM and SVRG respectively.


Stochastic Gradient Methods for Distributionally Robust Optimization with f-divergences

Neural Information Processing Systems

We develop efficient solution methods for a robust empirical risk minimization problem designed to give calibrated confidence intervals on performance and provide optimal tradeoffs between bias and variance. Our methods apply to distributionally robust optimization problems proposed by Ben-Tal et al., which put more weight on observations inducing high loss via a worst-case approach over a non-parametric uncertainty set on the underlying data distribution. Our algorithm solves the resulting minimax problems with nearly the same computational cost of stochastic gradient descent through the use of several carefully designed data structures. For a sample of size n, the per-iteration cost of our method scales as O(logn), which allows us to give optimality certificates that distributionally robust optimization provides at little extra cost compared to empirical risk minimization and stochastic gradient methods.


Gradient Methods for Submodular Maximization

Neural Information Processing Systems

In this paper, we study the problem of maximizing continuous submodular functions that naturally arise in many learning applications such as those involving utility functions in active learning and sensing, matrix approximations and network inference. Despite the apparent lack of convexity in such functions, we prove that stochastic projected gradient methods can provide strong approximation guarantees for maximizing continuous submodular functions with convex constraints. More specifically, we prove that for monotone continuous DR-submodular functions, all fixed points of projected gradient ascent provide a factor $1/2$ approximation to the global maxima. We also study stochastic gradient methods and show that after $\mathcal{O}(1/\epsilon^2)$ iterations these methods reach solutions which achieve in expectation objective values exceeding $(\frac{\text{OPT}}{2}-\epsilon)$. An immediate application of our results is to maximize submodular functions that are defined stochastically, i.e. the submodular function is defined as an expectation over a family of submodular functions with an unknown distribution. We will show how stochastic gradient methods are naturally well-suited for this setting, leading to a factor $1/2$ approximation when the function is monotone. In particular, it allows us to approximately maximize discrete, monotone submodular optimization problems via projected gradient ascent on a continuous relaxation, directly connecting the discrete and continuous domains. Finally, experiments on real data demonstrate that our projected gradient methods consistently achieve the best utility compared to other continuous baselines while remaining competitive in terms of computational effort.


Optimal Learning for Multi-pass Stochastic Gradient Methods

Neural Information Processing Systems

We analyze the learning properties of the stochastic gradient method when multiple passes over the data and mini-batches are allowed. In particular, we consider the square loss and show that for a universal step-size choice, the number of passes acts as a regularization parameter, and optimal finite sample bounds can be achieved by early-stopping. Moreover, we show that larger step-sizes are allowed when considering mini-batches. Our analysis is based on a unifying approach, encompassing both batch and stochastic gradient methods as special cases.





2 Frameworkandassumptions 2.1 Stochasticoptimizationundertimedrift ThroughoutSections2-4,weconsiderthesequenceofstochasticoptimizationproblems min

Neural Information Processing Systems

Our results concisely explain the interplay between the learning rate, the noise variance in the gradient oracle, and the strength ofthetime drift. The high-probability results merely assume that thegradient noise and time drift have light tails. Moreover, none of the results require the objectives to have bounded domains.


2 Frameworkandassumptions 2.1 Stochasticoptimizationundertimedrift Weconsiderthesequenceofstochasticoptimizationproblems min

Neural Information Processing Systems

Our results concisely explain the interplay between the learning rate, the noise variance in the gradient oracle, and the strength ofthetime drift. The high-probability results merely assume that thegradient noise and time drift have light tails. Moreover, none of the results require the objectives to have bounded domains.